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ABSTRACT 

We describe the time evolution of gene expression levels by using a time 
translational matrix to predict future expression levels of genes based on 
their expression levels at some initial time. We deduce the time translational 
matrix for previously published DNA microarray gene expression data sets 
by modeling them within a linear framework using the characteristic modes 
obtained by singular value decomposition. The resulting time translation 
matrix provides a measure of the relationships among the modes and governs 
their time evolution. We show that a truncated matrix linking just a few 
modes is a good approximation of the full time translation matrix. This 
finding suggests that the number of essential connections among the genes 
is small. 
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Introduction 

The development and application of DNA- and oligonucleotide-micro- 
array techniques(l, 2) for measuring the expression of many or all of an 
organism's genes has stimulated considerable interest in using expression 
profiling to elucidate the nature and connectivity of the underlying genetic 
regulatory networks (3-9). Biological systems, whether organismal or sub- 
organismal, are robust, adaptable, and redundant (10). It is increasingly 
apparent that such robustness is inherent in the evolution of networks (11). 
More particularly, it is the result of the operation of certain kinds of bio- 
chemical and genetic mechanisms (12-18). 

Analysis of global gene expression data to group genes with similar 
expression patterns has already proved useful in identifying genes that con- 
tribute to common functions and are therefore likely to be co-regulated (19- 
23). Whether information about the underlying genetic architecture and 
regulatory interconnections can be derived from the analysis of gene expres- 
sion patterns remains to be determined. Both the subcellular localization 
and activity of transcription factors can be influenced by post-translational 
modifications and interactions with small molecules and proteins. These can 
be extremely important from a regulatory perspective, but undetectable at 
the gene expression level, complicating the identification of causal connec- 
tions among genes. Nonetheless, a number of conceptual frameworks for 
modeling genetic regulatory networks have been proposed (3-9). 

Several groups have recently applied standard matrix analysis to large 
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gene expression datasets, extracting dominant patterns or "modes" of gene 
expression change(24-26). It has become evident that the complexity of 
gene expression patterns is low, with just a few modes capturing many of the 
essential features of these patterns. The expression pattern of any particular 
gene can be represented precisely by a linear combination of the modes with 
gene-specific coefficients (25). Furthermore, a good approximation of the 
exact pattern can be obtained by using just a few of the modes, underscoring 
the simplicity of the gene expression patterns. 

In the present communication we consider a simple model in which the 
expression levels of the genes at a given time are postulated to be linear 
combinations of their levels at a previous time. We show that the temporal 
evolution of the gene expression profiles can be described within such a linear 
framework by using a "time translation" matrix which reflects the magnitude 
of the connectivities between genes and makes it possible to predict future 
expression levels from initial levels. The basic framework has been described 
previously, along with initial efforts to apply the model to actual datasets (5, 
7-9). The number of genes, g, typically far exceeds the number of time points 
for which data are available and this makes the problem of determining the 
time translation matrix an ill-posed one. The basic difficulty is that in 
order to uniquely and unambiguously determine the g 2 elements of the time 
translation matrix, one needs a set of g 2 linearly independent equations. 
D'haeseleer et al. (1999) used a non-linear interpolation scheme to guess 
the shapes of gene expression profiles between the measured time points. As 
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noted by the authors, their final results are crucially dependent on the precise 
interpolation scheme and are therefore speculative. Van Someren et al. (9) 
instead chose to cluster the genes and study the interrelationships between 
the clusters. In this situation, it is possible to determine the time translation 
matrix unambiguously, provided that the clustering is meaningful. However, 
most clustering algorithms are based on profile similarity, the biological 
significance of which is not entirely clear. 

Here we construct the time translation matrix for the characteristic 
modes obtained using singular value decomposition (SVD). The polished 
expression data (22) for each gene may be viewed as a unit vector in a 
hyperspace, each of whose axes represents the expression level at a mea- 
surement time of the experiment. The SVD construction ensures that the 
modes correspond to linearly independent basis vectors, a linear combination 
of which exactly describes the expression pattern of each gene. Furthermore, 
this basis set is optimally chosen by SVD so that the contributions of the 
modes progressively decrease as one considers higher order modes (24-26). 

Our results suggest that the causal links between the modes, and thence 
the genes, involve just a few essential connections. Any additional connec- 
tions among the genes must therefore provide redundancy in the network. 
An important corrollary is that it may be impossible to determine detailed 
connectivities among genes with just the microarray data because the num- 
ber of genes greatly exceeds the number of contributing modes. 
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Methods 

It was shown recently (24-26) that the essential features of the gene 
expression patterns are captured by just a few of the distinct characteristic 
modes determined through SVD. In the previous work (25), we treated the 
gene expression pattern of all the genes as a "static" image and derived the 
underlying genome-wide characteristic modes of which it is composed. Here 
we carry out a dynamical analysis, exploring the possible causal relationships 
among the genes by deducing a time translation matrix for the characteristic 
modes defined by SVD. 

In order to deduce the time translation matrix, we consider an exact 
representation (25) of the gene expression data as a linear combination of 
all the r modes obtained from SVD. Each gene is characterized by r gene 
specific coefficients, where r is one less than the number of time points in 
the polished data set (22). The key goal is to attack the inverse problem 
and infer the nature of the gene network connectivity. However, the number 
of time points is smaller than the number of genes and thus the problem is 
underdetermined. Nevertheless, the inverse problem is mathematically well 
defined and tractable if one considers the causal relationships among the r 
characteristic modes obtained by SVD. This is because, as noted earlier, the 
r modes form a linearly independent basis set. 

Let 



Y(t) 



( Mt) \ 

\ X r (t) ) 



(1) 



represent the expression levels of the r modes at time t. Then, mathemati- 
cally, our linear model is expressed as 

Y(t + At) = M ■ Y(t) (2) 

where M is a time-independent r xr time translation matrix which provides 
key information on the influence of the modes on each other. The time step, 
At, is chosen to be the highest common factor among all of the experi- 
mentally measured time intervals so that the time of the j th measurement is 
tj = rijAt, where rij is an integer. For equally spaced measurements, rij = j. 

In order to determine M, we define a quantity Z(t) with the initial 
condition Z(to) = Y(to) and, for all subsequent times, Z determined from 
Z(t + At) = M ■ Z(t). For any integer k, we have 

Z{t + kAt) = M k -Y(t ). (3) 

The r 2 coefficients of M are choosen to minimize the cost function 

CF = ]T \\Y( tj )-Z( tj )f / ^2\\Y( tj )f. (4) 

3 3 

For equally spaced measurements, M can be determined exactly using a 
linear analysis so that CF = 0. For unequally spaced measurements, the 
problem becomes non-linear and it is necessary to deduce M using an opti- 
mization technique such as simulated annealing (27). The outcome of this 
analysis is that the gene expression data set can be re-expressed precisely 
using the r specific coefficients for each gene (a linear combination of the r 
modes with these coefficients gives the gene expression profile) , the rxr time 
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translation matrix, M, deduced as described above, and the initial values of 
each of the r modes. 

Results 

We have determined M, the r x r time translation matrix, for three 
different data sets of gene expression profiles: yeast cell-cycle (CDC15) (20) 
using the first 12 equally spaced time points representing the first two cycles, 
yeast sporulation (21) which has 7 time points and human fibroblast (22) 
which has 13 time points (Table 1). The matrix element Mjj describes the 
influence of mode j on mode i. Specifically, the coefficient Mij multiplied 
by the expression level of gene j at time t contributes to the expression 
level of gene i at time (t + At). A positive matrix element leads to the 
i'th gene being positively reinforced by the j'th gene expression level at a 
previous time. M is determined exactly and uniquely for the yeast cell- 
cycle data. The unequal spacing of the time points in the two other data 
sets precluded an exact solution and M is an approximation derived using 
simulated annealing techniques (27). We have verified that the accuracy 
of M is very high by showing that the temporal evolution of the modes 
is reproduced well and that the reconstructed gene expression patterns are 
virtually indistinguishable from the experimental data. The singular values 
are spread out and the amplitudes of the modes decrease as one considers 
higher order modes (25). This fact implies that the influence of the dominant 
modes on the other modes is generally small. Interestingly, for the cdcl5 
and sporulation data sets, the converse is also true and the dominant modes 
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are not strongly impacted by the other modes, especially when one takes 
into account the lower amplitudes of the higher order modes. This finding 
suggests that a few-mode approximation ought to be excellent for these two 
cases. 

Once the matrix M characterizing the interrelationship between the r 
modes is determined, it is a simple matter to deduce a matrix that similarly 
describes the interactions between any other set of r linearly independent 
profiles. Specifically, one can straightforwardly determine the interrelation- 
ships between r clusters of genes. As an example, consider the sporulation 
data (14) which is characterized by r=6. The problem of deriving the time 
translation matrix is underdetermined if the number of clusters exceeds six 
and then there is no unique solution. When the number of clusters is less 
than six there is no guarantee that there exists even one solution. We there- 
fore consider six clusters (metabolic, early I, early II, middle, mid-late and 
late), excluding the early-mid cluster which forms the least coherent group. 
The average expression patterns of the six clusters (c\, . . . , cq) are obtained 
as averages over the genes within the cluster and can be expressed as linear 

combinations of the six modes as 

/ ci(t) \ 
c 2 (t) 

C(t)= )/ =S-Y(t) (5) 

where S is a 6 x 6 matrix. The rows of S are the components of each of the 
characteristic modes that make up the average expression pattern for the 
six clusters. The interrelationships between the cluster expression patterns 



is determined with a time translation matrix of the form 



N = S ■ M ■ S 



-i 



(6) 



so that 



C{t + At) = N- C(t) 



(7) 



The averages of the experimental measurements (circles) and the predicted 
expression patterns (lines) of the six clusters are shown in Fig. 1 and are 
in excellent agreement, confirming the accuracy of the M matrix for the 
sporulation data in Table 1. The matrix N is shown in Table 2. The 
significance of the entries in N is similar to that described earlier for M. 
That is, the matrix element Mi a describes the influence of cluster j on 
cluster i. Specifically, the coefficient M{j multiplied by the expression level 
of cluster j at time t contributes to the expression level of cluster i at time 
(t + At). A positive matrix element leads to the i'th cluster being positively 
reinforced by the j'th cluster expression level at a previous time. 

Does one need the full rxr time translation matrix to describe the gene 
expression patterns? Or is an appropriately chosen truncated time transla- 
tion matrix adequate to reconstruct the expression patterns with reasonable 
fidelity? We now consider a linear interaction model (Eq. [2]) within which 
M is a 2 x 2 matrix and only the two most important modes are used. The 
values of the four entries in the matrix M are determined using an optimiza- 
tion scheme that minimizes the cost function similar to that given in Eq. ||. 
The resulting M matrices are shown in Table 3 and a comparison of the cal- 
culated modes (solid lines) with those obtained by SVD (dashed lines) for 
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the three sets of gene expression profiles is shown in Fig. 2. It is interesting 
to compare these 2x2 matrices with the corresponding portion of the full 
matrices shown in Table 1. The two mode approximation is excellent for the 
cdcl5 data set (CF=0.05), moderate for the sporulation data set (CF=0.18) 
and not as good for the fibroblast data set (CF=0.31) as for the others. As 
noted before, the use of the full r x r time translation matrix leads to an 
exact reproduction of the data set. Not unexpectedly, the quality of the fit 
improves as the number of modes considered is increased. Figure 3 shows 
the reconstructed expression profiles starting with the initial values and us- 
ing the 2x2 time translation matrix (denoted by a), the profiles obtained 
as a linear combination of the top two modes with appropriate gene-specific 
coefficients (b) and the experimental data (c) for the three data sets. In 
all three cases, the main features of the expression patterns are reproduced 
quite well by the time translation matrix with just two modes. The 2-mode 
reconstruction of the CDC15 profiles is the most accurate of the three. 

It can be shown that, in general, a 2 x 2 time translation matrix produces 
only two types of behavior, depending on its eigenvalues. If the eigenvalues 
are real, the generated modes will independently grow or decay exponen- 
tially. When the eigenvalues are complex conjugates of each other, as they 
are for all three cases we have examined, the two generated modes are oscil- 
latory with growing or decaying amplitudes. Mathematically, the two modes 
are constrained to have the form: 

Xi(t) = c AG^ At hin(— + A) , (8) 

r 
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X 2 (t) = cGW At hin(— + A + <j>) . (9) 

r 

Both modes are described by a single time period, r, and a single growth 
or decay factor, G. Because there are four parameters in the matrix M, 
there can only be four independent attributes in the generated modes. Two 
other parameters, c and A, are determined from the initial conditions. In 
addition to r and G, we can also determine the phase difference between 
the two modes, (f>, and the relative amplitude of the two modes, A. These 
attributes can be determined from the coefficients in M using the equations 
in Table 4. Table 5 shows the four attributes for each of the three data 
sets. The self-consistency of our analysis is underscored by the fact that the 
magnitude of the growth factor, G, is close to one for all three cases, which 
is a biologically pleasing result in that the modes do not grow explosively 
or decay. For the cell cycle data, the characteristic period is about 115 
minutes. In the other two cases the data are not periodic and hence the 
best-fit periods are comparable to the duration of the measurement. For 
the yeast cell cycle data, (f), the phase difference between the top two modes 
is 90°, suggesting a simple sine-cosine relationship, as noted by Alter et al. 
(26). Indeed, this result is self-consistent. When G is equal to 1 and an 
integer number of periods is considered, orthogonality of the top two modes 
requires that the phase difference be 90°. 

In summary, we have shown that it is possible to describe genetic ex- 
pression data sets using a simple linear interaction model with only a small 
number of interactions. One important implication is that it is impossible 
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to determine the exact interactions among individual genes in these data 
sets. The problem is underdetermined because the number of genes is much 
larger than the number of time points in the experiments. Nonetheless, we 
have shown that it is possible to accurately describe the interactions among 
the characteristic modes. Moreover, an interaction model with only two con- 
nections reconstructs the key features of the gene expression in the simplest 
cases with good fidelity. Our results imply that because there are only a few 
essential connections among modes and therefore among genes, additional 
links provide redundancy in the network. 

This work was supported by an Integrative Graduate Education and Re- 
search Training grant from the National Science Foundation, Istituto Nazionale 
di Fisica Nucleare (Italy), Komitet Badan Naukowych Grant 2P03B-146-18, 
Ministero dell' Universita e della Ricerca Scientifica, National Aeronautics 
and Space Administration, and National Science Foundation Plant Genome 
Research Program grant DBI-9872629. 
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TABLE 1 



a) (cdcl5) 



0.468 


-1.032 


0.114 


-0.199 


-0.046 


0.158 


0.342 


-0.360 


-0.024 


0.264 


-0 


0.695 


0.517 


0.007 


-0.551 


-0.011 


-0.330 


-0.183 


-0.078 


-0.175 


0.190 


-0 


0.125 


0.065 


0.482 


0.811 


-0.105 


0.027 


0.165 


0.153 


0.008 


-0.543 





-0.015 


-0.030 


-0.182 


0.306 


0.543 


-0.087 


0.360 


-1.113 


-0.680 


-0.993 


-0 


0.045 


-0.004 


-0.339 


0.225 


0.498 


0.433 


-0.304 


0.276 


0.237 


0.155 


-0 


0.007 


0.027 


-0.252 


-0.017 


-0.120 


-0.321 


0.628 


-0.159 


0.420 


0.195 





0.002 


-0.034 


-0.104 


0.061 


0.005 


-0.366 


-0.299 


0.145 


-0.839 


0.317 





0.010 


0.041 


-0.030 


0.053 


-0.370 


0.394 


-0.175 


-0.558 


-0.093 


0.559 





0.016 


-0.005 


-0.112 


-0.032 


-0.214 


0.355 


0.254 


0.291 


-0.349 


-0.499 





0.011 


-0.022 


-0.087 


-0.009 


-0.200 


-0.139 


-0.426 


-0.111 


0.310 


-0.535 





-0.019 


0.002 


-0.075 


0.057 


-0.192 


-0.105 


0.030 


0.069 


-0.185 


-0.071 


-0 



b) (speculation) 

" 0.975 -0.366 -0.431 -0.140 -0.076 0.143 " 

0.096 0.734 -0.636 -0.186 -0.032 -0.143 

-0.223 -0.386 -0.090 -0.650 -0.482 -0.417 

' -0.086 -0.059 -0.396 0.587 -0.482 -0.046 

0.098 -0.009 -0.165 0.640 1.223 0.336 

0.002 0.035 0.590 0.182 -0.576 -0.965 . 

c) (fibroblast) 



M = 



0.760 


0.313 


0.334 


-0.116 


-0.732 


-1.389 


-0.954 


-0.456 


0.199 


0.290 


-0 


0.427 


0.508 


-0.525 


0.884 


0.783 


0.142 


1.880 


-0.517 


-0.155 


-0.678 


2 


-0.091 


0.483 


0.884 


-0.199 


-0.207 


1.332 


-1.023 


-0.359 


-1.834 


0.653 


-1 


-0.113 


0.251 


0.014 


0.055 


0.253 


0.840 


-1.024 


0.779 


-0.263 


0.221 


-1 


0.012 


-0.057 


0.525 


-0.317 


0.281 


0.820 


-0.051 


0.284 


-0.422 


0.274 


-1 


0.042 


0.157 


0.303 


-0.317 


-0.415 


0.509 


-0.219 


-0.722 


-0.067 


-0.002 


-0 


-0.019 


0.074 


0.092 


-0.724 


-0.665 


-0.192 


0.478 


-0.076 


0.542 


-0.333 


-0 


0.114 


0.085 


-0.108 


0.183 


-0.187 


0.510 


-0.109 


0.165 


-0.349 


0.256 


-0 


0.074 


0.081 


0.300 


-0.435 


-0.122 


-0.048 


-0.187 


-0.789 


-0.054 


-0.280 


-0 


-0.132 


-0.154 


-0.101 


0.119 


0.163 


-0.859 


0.044 


-0.289 


1.998 


0.004 


-0 


0.057 


0.044 


0.155 


-0.091 


0.038 


0.383 


-0.148 


-0.447 


-0.343 


0.139 





-0.013 


-0.050 


0.072 


0.267 


-0.084 


0.223 


-0.265 


0.071 


-0.201 


0.122 
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TABLE 2 



(sporulation groups) 



N 



2 


233 


-3.570 


0.182 


-1 


722 


-0.440 


-0.655 


1 


913 


-1.921 


0.509 





118 


-0.356 


-0.287 


-0 


707 


3.949 


1.219 


2 


638 


1.175 


0.707 


-1 


157 


0.422 


-0.421 


-0 


525 


-0.169 


-0.130 


-0 


905 


0.954 


-0.640 


-0 


515 


0.823 


-0.057 


-1 


294 


0.699 


0.212 


-1 


232 


1.014 


0.635 



TABLE 3 



a) (cdcl5) 



b) (sporulation) 



c) (fibroblast) 



M 



M 



M = 



0.469 
0.621 



1.078 
0.214 



0.941 
0.110 



-1.283 
0.468 



-0.342 
0.812 



-0.045 
1.033 



TABLE 4 



M 



a b 
c d 



G = V ad — be 



A = ^J-b/c 
2vrAt 



cos 



-1 / a+d\ 
\ 2G ) 
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/ a- d \ 

6 = COS ; 



TABLE 5 





G 


A 


r 




cdcl5 


1.008 


1.437 


115 minutes 


90.0° 


sporulation 


0.974 


1.264 


12.8 hours 


60.6° 


fibroblast 


0.988 


0.640 


29.2 hours 


130.8° 
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FIGURE CAPTIONS 



Figure 1. A comparison of measured and calculated expression profiles. Av- 
erage expression profiles for the six clusters of genes in the sporulation data 
set (14) are represented by circles and the approximated values calculated 
using the best-fit time translation matrix are shown as lines. 

Figure 2. The first two characteristic modes for the a) cdcl5, b) sporulation 
and c) fibroblast data sets. The circles correspond to the measured data and 
the lines show the approximations based on the best-fit 2x2 time translation 
matrices. 

Figure 3. A reconstruction of the expression profiles for the cdcl5 (first 
three panels), sporulation (middle three panels), and fibroblast (last three 
panels) data sets. For each set, panel a shows the results obtained using the 
2x2 time translation matrix to determine the temporal evolution of the 
expression profiles from their initial values, panel b shows expression levels 
expressed as linear combinations of just the two top modes, whereas panel 
c shows the experimental data. 
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